Tiffany Chan
Neural Network Assignment - Bank Churn Assignment
#Install tensorflow
!pip install tensorflow==2.0
#Let's import all our regular libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
#Let's import the tensorflow, and accompanying components
import tensorflow as tf
from tensorflow.keras import Sequential #For making neural network models
from tensorflow.keras.layers import Dense #For making the hidden layers
from tensorflow.keras.layers import Conv1D,Flatten #For flattening the first layer
1. Read the dataset
from google.colab import files
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded['bank.csv']))
#Let's look to see if there are any duplicates just in case.
from collections import Counter #Let's use this to check for duplicates
id = df['CustomerId'] #Let's compaare the customer ids
d = Counter(id)
res = [k for k, v in d.items() if v > 1]
print(res)
There are no duplicates in the dataset. Let's proceed to delete the features with unique identifiers.
2. Drop the columns which are unique for all users like IDs (5 points)
#Let's drop the unique identifier columns: CustomerID, Surname and RowNumber
df = df.drop("CustomerId" , axis=1)
df = df.drop("Surname", axis = 1)
df = df.drop("RowNumber", axis = 1)
df.head() #Checking to see if the unique identifiers are deleted.
#Descriptive Statistics
df.describe()
For descriptive statistics above, it is important to look at the continuous variables because trying to understand these numbers for the categorical variables would not make sense. It is obvious that age will be rightly skewed, especially because the mean is around 39, and the max is 92.
For the purpose of this exercise, examining these variables bivariately against the 'Exited' target variable may generate more more understanding.
#Finding out if there are any NAs and null values. This is to see if we need to impute any null values.
print(df.isnull().sum())
#Let's look at the dimensions of the data.
df.shape
#Let's look at what the dtype is for each feature.
df.info()
#Let's look at the frequencies for categorical variables (univariate analysis).
#Examining the categorical variables.
print("Geography Frequency")
print(df['Geography'].value_counts())
print("")
print("Gender Frequency")
print(df['Gender'].value_counts())
print("")
print("Tenure")
print(df['Tenure'].value_counts())
print("")
print("Number of Products Frequency")
print(df['NumOfProducts'].value_counts())
print("")
print("Has Credit Card Frequency")
print(df['HasCrCard'].value_counts())
print("")
print("Is An Active Member Frequency")
print(df['IsActiveMember'].value_counts())
print("")
print("Exited Frequency (Target Variable)")
df['Exited'].value_counts()
After looking at these univariate frequencies for categorical variables, it is clear that the data is very uneven. The target variable ('Exited') is perhaps the most problematic. The number of those who chose to exit was more than 3 times (almost 4 times compared to) those who decided to stay. For this reason, if we ran the neural network, there may be a bias towards people staying.
You could see the same trend in the HasCrCard variable as well. People who have at least 1 credit card were more than 3 times than those who did not.
The geography feature is also problematic. There are more than double the amount of French customers compared to Spain and Germany.
Tenure is also not very evenly dispersed either.
In terms of the number of products the customers have, people with 1 product versus 2 products is evenly distributed enough. On the other hand, only 266 individuals had 3 items, and 60 customers have 4 products.
Gender and IsActiveMember are the only categorical variables that seem to be evenly distributed.
Some potential solutions to solve this issue:
Downsample (Cut down the zeroes (0) in the target variable.) Disadvantage: Removing a lot of data and may not be able to capture relevant information that can predict the target variable outcome.
Upsample (Synthetically create more ones (1) in the target variable.) Disadvantage: Overfit the model to the training data and can bomb on the test data.
Use categorical_entropy for loss. Categorical_entropy is quite robust and can withstand unbalanced datasets that are biased towards one group.
Since the assignment is only asking for accuracy, it is fine to leave the data as is, because the accuracy metric is reliant mostly on predicting the majority class (in this case, it is the people who stayed with the bank). In predicting the minority class (those who actually did exit) well, examining precision, and recall metrics is important.
3. Perform bivariate analysis and give your insights from the same (5 points)
sns.pairplot(data = df)
From the above plots, the most important are the ones where the predictive variables are plotted against the targeted variable ('Exited') because we can determine if there are strong predictor variables in the data. Looking at the pairplots, there isn't one variable that is particularly strong at predicting whether a customer will stay or leave because all data points are pretty much evenly distributed amongst those who left and those who stayed with the bank.
Despite this, there are certain conclusions we can draw, like those who have the lowest credit scores are more likely to leave. Those who are in the higher age range are more likely to stay with the bank. Those who have the highest balances (the leverage points) are more likely to leave.
Despite there being a lot of variables displaying crowded data in a line, we should not rush to eliminate these features because they may still be relationships that may not be immediately apparent with the target variable.
When it comes to multicollinearity, neural networks aren't really affected by it because of the complexity of its architecture. Neural networks are overparametized and the linear additions in each neuron from each layer originate from previous layer's inputs. So, multicollinearity, which affects simpler models and smaller sample sizes is not as influential on the make-up of the neural network. For this reason, it may not be that important to look at Pearson's correlations between different independent variables.
However, from these plots above, we can also observe the leverage points. Leverage points are very similar to outliers but are observable extremities in bivariate plots. Even if we did delete the outliers for univariate analysis, there may still be leverage points to handle.
For age x exited, and balance x exited there clearly are extremities. In this case, we need to decide whether we should delete these outliers or not. Since this is a neural network classification problem, we need to take into consideration the activation functions and which one(s) could help obtain better results.
Discussion of potential activation functions:
ReLu is an activation function that is easily influenced by outliers and extremities. So, if we want to work with ReLu, it may be better to delete the leverage points beforehand.
The Sigmoid activation function is robust and has squashing properties for extreme values. For this reason, sigmoid may be a better option if we choose not to delete those cases.
Softmax is also an activation function for classification models as well but used in the output layer to convert raw numerical calcuations from the neural network into interpretable probabilities.
4. Distinguish the feature and target set and divide the data set into training and test sets (5 points)
#Let's divide the dataset into predictor variables and the target variable.
X = df.drop("Exited", axis = 1) # This X data will be all the predictor variables. So, 'Exited' must be removed.
y= df.iloc[:,-1].values # This is for the target variable 'Exited', which is the last column in the df dataframe. It needs to be made into an array.
#Double check to see if exited is removed from X
X.head()
#Get Dummies for the geography categorical variable. We don't need to get dummies for the other categorical variables because they are mostly binary.
#As for tenure, it can be interpreted as a continuous variable. So, we don't really have to alter any other variable except geography.
X = pd.get_dummies(X, columns=['Geography'])
#We should use a label encoder for those variables like Gender where the categories should be converted to numbers.
from sklearn.preprocessing import LabelEncoder
for col in X.select_dtypes(include=['object']):
encoder=LabelEncoder()
X[col]=encoder.fit_transform(X[col])
#Dividing into train and test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#Let's make a copy of the X_train data so that we can run one X_train on our initial model and another one on the improved model.
X_train_init = X_train #Both are the same. I just want to run X_train_init on our preliminary model and X_train on our final model after hypertuning all our parameters.
y_train_init = y_train #Both are the same. I just want to run y_train_init on our preliminary model and y_train on our final model after hypertuning all our parameters.
5. Normalize the train and test data (10 points)
#Normalize the train data. Let's just normalize the train data for the initial model first.
#We do not know whether using normalize or standardscaler would be better.
#We will normalize/standardize the test data later based on whichever way yields better results.
import pandas as pd
from sklearn import preprocessing
X_train_init = preprocessing.normalize(X_train_init) #Normalize the initial model train data.
#Make sure we understand the shape of the data so that we know what to put into the input layer for dimensions.
print(X_train_init.shape)
print(y_train_init.shape)
6. Initialize & build the model. Identify the points of improvement and implement the same. (20)
#Build the neural network framework for the initial model.
from tensorflow import keras
from tensorflow.keras.layers import Dense,Conv1D,Flatten
from tensorflow.keras.models import Sequential, Model
model = Sequential()
#Let's add the layers (input, hidden and output)
model.add(Dense(100, input_shape = (12,))) #There is no activation function here in the input layer because there is no transformation being done.
model.add(Dense(300, activation='sigmoid')) #Let's use either one of the more common activation functions here, either ReLu or Sigmoid. Since we did not remove outliers/leverage points, we should use sigmoid because it has squashing capabilities.
model.add(Dense(1,activation='softmax')) #Since this is a classification model, let's use softmax as activation function for this output layer.
#Designate the optimizer, the metrics used for evaluation and the loss function.
model.compile(optimizer='adam',metrics=['accuracy'],loss='binary_crossentropy') # I tried to use categorical_crossentropy here but kept getting an error. So, I changed it to binary_crossentropy because this is a binary classification problem afterall.
#Training the model. Make the batch size large first, and set the amount of epochs to how many times you want the data to go through the model.
#We should also include the validation testing to ascertain we are not overfitting, which is very important.
model.fit(X_train_init,y_train_init,batch_size=100,validation_split=0.1,epochs=100)
This was the first model I ran for this assignment.
As you can tell, the accuracy score is very low (approx. 0.2) for the training data. I ran many models in order to improve it. I learned that softmax is used more frequently for multi-class classification.
Changing the activation function in the output layer:
Softmax converts the raw calculations into the probabilities that compare one class to another. For binary classification neural networks, we don't normally compare groups with one class like we do in logistic regression. Sigmoid activation function yielded a far better result. There was an approximate 0.6% improvement in the accuracy.
Changing normalization to using standardscaler:
For some reason, using standardscaler improved the accuracy of the model.
Increasing the epochs:
After running many models on the train data, and validation set, I realize that increasing the epochs improves the training accuracy but runs the risk of the model being overfit on the training data because the validation set accuracy does not improve with the training data.
Adding layer and nodes:
Adding one hidden layer and nodes to the hidden layer improved the predictive accuracy.
Upsampling/Downsampling:
There was also attempts to upsample and downsample the target group data to see if it would improve the accuracy of the model. Both only improved the model's accuracy to approx. 0.5. Due to the emphasis on the minority group, accuracy would not likely improve drastically by upsampling/downsampling. As a result, upsampling and downsampling were abandoned for better practices, like NN hyperparameter tuning, that could cause a more noteworthy improvement in the accuracy metric.
IMPROVED MODEL
#We are changing from normalizer to StandardScaler because the results were better. This is also answering question 5 about normalizing the train and test data.
from sklearn.preprocessing import MinMaxScaler,StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Make sure we understand the shape of the data so that we know what to put into the input layer for dimensions.
print(X_train.shape)
print(y_train.shape)
# Implementing a new framework for the model and adding the necessary layers for the improved model.
model=Sequential()
model.add(Dense(180,input_shape=(12,))) # 12 input dimensions.
model.add(Dense(100,activation='sigmoid'))
model.add(Dense(1,activation='sigmoid')) #Sigmoid activation function because this is a binary classification problem.
#Change the loss function from the initial model to binary_crossentropy because this is a binary classification problem.
model.compile(optimizer='adam',metrics=['accuracy'],loss='binary_crossentropy')
#Train the model, and use the validation set to ensure of no overfitting.
model.fit(X_train,y_train,batch_size=60,validation_split=0.1,epochs=100)
One can see that the accuracy of the model with the training data improved to 0.9, and the loss was minimal. However, the validation accuracy score remains at 0.87. The model is performing well and is not too overfit.
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print('Accuracy: %.3f' % acc)
print('Loss: %.3f' % loss)
The accuracy for the test data is pretty close to the train and validation score. So, the model is pretty decent when it comes to its predictions.
7. Predict the results using 0.5 as a threshold (10 points)
#Predicting results with a 0.5 threshold.
pred=model.predict(X_test)
pred=np.where(pred>0.5,1,0) #Setting the threshold to 0.5
pred #Making predictions.
8. Print the Accuracy score and confusion matrix (5 points)
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve
accuracy_score(y_test,pred)
The accuracy score of the test data ended up being 0.86, with a loss of 0.36. It is fairly strong. Observing the results from the training and validation process, we can see that the model is not too overfit. The accuracy for the training data ended up being 0.90 (loss = 0.25), while the validation data accuracy was 0.87 (0.36). So, the test data did not bomb too badly.
#Let's plot the confusion matrix and other related metrics for the test data.
#Please scroll all the way down to see the metrics and confusion matrix for this model!
import matplotlib as plt
y_pred_cls = model.predict_classes(X_test, batch_size= 200, verbose = 0)
print('Accuracy_Model (Dropout): ' + str(model.evaluate(X_test,y_test)[1]))
print('Recall_Score:' + str(recall_score(y_test, y_pred_cls)))
print('Precision_Score:' + str(precision_score(y_test, y_pred_cls)))
print('F-score' + str(f1_score(y_test, y_pred_cls)))
confusion_matrix(y_test, y_pred_cls)
#Making confusion matrix look better.
confmatrix=pd.DataFrame(confusion_matrix(y_test,pred))
confmatrix.index=['Actual_0','Actual_1']
confmatrix.columns=['Predicted_0','Predicted_1']
confmatrix
Accuracy is mostly reliant on the correctly predicted cases. Of course, this metric's success is mostly due to those customers that were predicted to not leave and actually did not leave. However, this metric is not the best for understanding misclassification.
Precision and recall are better metrics for determining how great this model is in predicting customers' churning behavior because the minority class is the highlight of these metrics.
Precision: True Positive/True Positive + False Positive. 69% of those who were predicted to leave actually left.
Recall: True Positive/True Positive + False Negative. 56% of those that actually left were predicted to leave.
These metrics show that this model is not that great at understanding the customer base's churning behavior, despite its high accuracy in predicting the people that would not leave. Considering that we used very robust activation functions that have strong squashing abilities to handle outliers, and loss functions that are equipped to handle uneven data without having to upsample or downsample the target variable, the neural network model actually performed pretty well.
The neural network performs well because of the components, and the focused details that go into the model. However, with bank data, other traditional machine learning models may be better. Models like random forest, gradient boost, and even logistic regression may be better solutions for predicting banking behavior, and business earning potential.
Neural networks seem to have higher performance in image recognition than it does with traditional banking questions.